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Abstract: Graphical models are widely used to model stochastic dependences among 
large collections of variables. We introduce a new method of estimating undirected con¬ 
ditional independence graphs based on the score matching loss, introduced by Hyvarinen 
(2005), and subsequently extended in Hyvarinen (2007). The regularized score match¬ 
ing method we propose applies to settings with continuous observations and allows for 
computationally efficient treatment of possibly non-Gaussian exponential family mod¬ 
els. In the well-explored Gaussian setting, regularized score matching avoids issues of 
asymmetry that arise when applying the technique of neighborhood selection, and com¬ 
pared to existing methods that directly yield symmetric estimates, the score matching 
approach has the advantage that the considered loss is quadratic and gives piecewise lin¬ 
ear solution paths under £i regularization. Under suitable irrepresentability conditions, 
we show that -regularized score matching is consistent for graph estimation in sparse 
high-dimensional settings. Through numerical experiments and an application to RNAseq 
data, we confirm that regularized score matching achieves state-of-the-art performance 
in the Gaussian case and provides a valuable tool for computationally efficient estimation 
in non-Gaussian graphical models. 
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1. Introduction 

Undirected graphical models^ also known as Markov random fields, are important tools for sum¬ 
marizing dependency relationships between random variables and have found application in 
many fields, including bioinformatics, language and speech processing, and digital communi¬ 
cations. Each such model is associated to an undirected graph G = {V,E), with vertex set V 
and edge set E C V x V. For a random vector X = (Xj : j € V) indexed by the nodes of G, 
the graphical model given by G requires that X, and X^ be conditionally independent given 
all other variables whenever nodes j and k are not joined by an edge in G (Lauritzen, 1996). 
If G is the smallest graph such that X satishes this requirement, we term G the conditional 
independence graph of X. In this case, Xj and X^ are conditionally independent given all other 
variables if and only if j and k are non-adjacent in G. We will always take the vertex set to be 
V = {!,..., m}, so m is the number of observed variables in X. 

Specihc models are obtained from additional distributional assumptions. Particularly, an 
assumption of multivariate normality gives Gaussian graphical models, for which estimation 
of conditional independence graphs is equivalent to covariance selection (Dempster, 1972). If 
X is jointly multivariate normal with mean vector p, and covariance matrix S—in symbols. 
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X ^ N{p,,Yl )—then the conditional independences among the random variables, and hence 
edges between nodes in the graph, are determined by the entries of the inverse covariance, or 
concentration matrix K = {Kjk) = More precisely, Kjk = 0 for j 7 ^ fc if and only if Xj 

and Xk are independent given all other variables. 

There is a large literature on selection of conditional independence graphs; see the refer¬ 
ences in Edwards (2000, Chap. 6 ) or Drton and Perlman (2007). In the last decade, attention 
has shifted to high-dimensional settings with the number of variables m comparable to or 
larger than the sample size n. This scenario arises, for instance, in microarray experiments. 
Fortunately, high-dimensional problems may remain tractable in the presence of structural 
constraints such as sparsity, i.e., if each node in the graph is incident to a small number of 
edges. This is of interest for microarray data as gene regulatory networks are intrinsically sparse 
(Leclerc, 2008). 

Gaussian models have been the primary tool for graphical modeling of data comprising 
continuous variables, such as gene expression data, and a large number of methods have been 
proposed for statistical estimation in high-dimensional Gaussian graphical models. A common 
strategy involves augmenting a loss function with a sparsity-inducing penalty such as an £ 1 , 
or lasso penalty. Two widely-used approaches are the graphical lasso or glasso (Yuan and Lin, 
2007) and neighborhood selection (Meinshausen and Biihlmann, 2006). In glasso, an £1 penalty 
on the entries of the inverse covariance matrix is added to the negative Gaussian log-likelihood. 
Neighborhood selection, on the other hand, is an £i-penalized pseudo-likelihood approach that 
leverages the fact that the node-wise full conditional distributions from a Gaussian graphical 
model form m linear regression models. Meinshausen and Biihlmann (2006) treat these separate 
regression models as having their parameters unrelated, but as we discuss below, methods that 
account for the symmetry in a concentration matrix have been proposed in subsequent work. 

Methods for high-dimensional data have also been developed for non-Gaussian settings. 
Miyamura and Kano (2006), Finegold and Drton (2011), Vogel and Fried (2011) and Sun and 
Li (2012) address robustness to outliers. Liu, Lafferty and Wasserman (2009), Liu et al. (2012) 
and Dobra and Lenkoski (2011) treat Gaussian copula models. Neighborhood selection/pseudo¬ 
likelihood procedures can also be applied to models for categorical models where the node-wise 
regression is logistic or multinomial (Lee, Ganapathi and Roller, 2007; Hofling and Tibshirani, 
2009; Ravikumar, Wainwright and Lafferty, 2010; Jalali et al., 2011). Allen and Liu (2013) 
and Yang et al. (2012) discuss extensions using node-wise generalized linear models, and semi- 
/nonparametric methods were proposed by Fellinghauer et al. (2013) and Voorman, Shojaie 
and Witten (2014). 

In this paper, we propose a different approach to high-dimensional graphical model selection. 
Addressing the case of continuous but not necessarily Gaussian observations, the proposed 
method is based on the score matching loss, first introduced by Hyvarinen (2005) in the setting 
of image analysis. Recently, Forbes and Lauritzen (2015) studied score matching in Gaussian 
graphical models with symmetry constraints, and demonstrated that, when the number of 
variables m is fixed, the estimators derived from the score matching loss are asymptotically 
efficient in some special cases, but not in general. Our focus is instead on the use of score 
matching in high-dimensional problems, for which we consider regularization with an £1 penalty. 
We will refer to this graphical model selection technique as regularized score matching. 

Regularized score matching is computationally very convenient for any exponential family 
comprising continuous distributions. Indeed, the score matching loss is a positive semi-definite 
quadratic function. It follows that the solution path for the regularized score matching problem 
is piecewise linear and can be computed in entirety. Moreover, theoretical analysis can be based 
on familiar techniques. Most importantly, as we demonstrate for Gaussian graphical models, 
regularized score matching exhibits state-of-the-art statistical efficiency in high-dimensional 
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settings. The method also performs well in our applications to non-Gaussian models, which 
include models that seem rather difficult to handle via other methods. 

In the Gaussian setting, regularized score matching is structurally closest to pseudo-likelihood 
methods with symmetry constraints, such as SPACE (Peng et ah, 2009), symmetric lasso 
(Friedman, Hastie and Tibshirani, 2010) and SPLICE (Rocha, Zhao and Yu, 2008). A thor¬ 
ough discussion of these different methods is given by Khare, Oh and Rajaratnam (2015) who 
also reformulate the SPACE objective function to ensure convergence of coordinate descent 
algorithms. They abbreviate their method as CONCORD. For brevity, we refer to these al¬ 
gorithms collectively as SPACE. We note that in contrast to regularized score matching, the 
SPACE methods do not have piecewise linear solution paths. Furthermore, as remarked be¬ 
fore, the computational convenience of regularized score matching carries over to non-Gaussian 
settings. 

A limitation of the original score matching introduced by Hyvarinen (2005) is that it re¬ 
quires the data to be generated from a distribution whose density is twice differentiable on 
K™. Hyvarinen (2007) proposed a generalization of the approach to the important case of 
non-negative data. For exponential families, the non-negative score matching loss is again a 
semidefinite quadratic function. We explore regularization of the non-negative score match¬ 
ing loss as a tool for estimation of conditional independence graphs from high-dimensional 
non-negative data, and we establish consistency of the method. 

The remainder of the paper is organized as follows. Section 2 provides the needed back¬ 
ground on score matching and its applications. In Section 3, we describe the proposed method, 
regularized score matching. Implementation details are given in Appendix A. In Section 4, we 
present results of numerical experiments to compare the performance of the procedure with 
existing approaches. An application to RNAseq data is given in Section 5. Section 6 provides 
sparsistency theory for both basic and non-negative regularized score matching. Proofs are 
given in Section 7 with details deferred to Appendix B and C. We end with a discussion in 
Section 8. 


Notation 

The following notational conventions are used throughout the paper: 

(i) Random variables/vectors are denoted by upper case letters; lower case letters are used 

for observed values. So, x S is an observed value of the random vector X. Similarly, 
X = {xij) G is a matrix of observed values, which will typically hold the realizations 

of n i.i.d. copies of X in its rows. We index the columns of a matrix with subscripts, so 
Xj refers to the _)th column of x. Superscripts in parentheses are used to refer to the rows 
of a matrix, so x^‘^ is the ith row of x. 

(ii) For a matrix U = (uy) £ we denote the vectorization obtained by stacking 

columns by 

vec(U) = (uii,M 21 

(hi) Let a,b G [1, oo]. We denote the ia norm of a vector u G K™ by 

y m 

ikiu = (1^*1 

and write |||U|||^^ = max|| 3 ;||^^i ||Ux ||5 for the ia/^b operator norm of a matrix U G 
M-x™. We let IIIUIIU = ll|U||L,,, and ||U|U = ||vec(U)|U. 
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2. Score Matching 


We begin with an overview of Hyvarinen’s score matching, discussing first random vectors 
supported on all of and then random vectors supported on the nonnegative orthant. We 
also review the convenient form of the score matching estimating equations in exponential 
families. 

2.1. Basic score matching 

Suppose X is a continuous random vector taking values in K"*, with joint distribution P. 
Suppose further that P belongs to the family V that comprises all probability distributions 
with support equal to and a twice differentiable density with respect to Lebesgue measure. 
We emphasize that in a statistical context the differentiability requirement is with respect to 
data. We write p to denote the density of P and adopt the usual notation for the gradient and 
Laplacian 

V/(x) = |^/(*)| e K™, A/(x) = ^ e K, 


of a function / : K™ —>■ K. 

For a distribution Q G V with density q, define the divergence function 

J{Q)=[ p(a:) [||Vlogg(x) - Vlogp(a:)||^] da; (2.1) 

Jr™ 

as the expected squared distance between the gradients of the log-densities of the two distri¬ 
butions Q and P. By choosing Q to minimize (2.1), we are matching ‘scores’ with respect to 
the data vector x. Hence, (2.1) has been referred to as the score matching loss. It is evident 
from (2.1) that the score matching loss is uniquely minimized when Q = P. 

Upon initial inspection, optimization of J{Q) seems to require knowledge of P in an impor¬ 
tant way. However, Hyvarinen (2005) showed that, under mild regularity conditions, the score 
matching loss (2.1) can be rewritten as: 


J{Q) = 


p{x) 


A\ogq{x) + -||Vlogg(x)||( 


dx 


const, 


( 2 . 2 ) 


where ‘const’ refers to a term independent of Q. The key term in the integrand in (2.2) is the 
so-called Hyvarinen scoring rule 

S{x,Q) = Alogg(x) -b ^||Vlog( 7 (a:)|| 2 . 

The integral in (2.2) admits an empirical version in which the integration with respect to 
P is replaced by an average over an observed sample, which we arrange into a data matrix 
X S This leads to the empirical score matching loss 

1 ” 

J(x,Q) = - V5(a;«,Q), (2.3) 

n 


and the score matching estimator (SME) 


Q = arg min J(x, Q)- 
Q 
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The score matching loss J{Q) was motivated by problems involving models whose distri¬ 
butions have an intractable normalization constant. Indeed, evaluating (2.2) and computing 
the SME Q requires no knowledge of the normalization constant, which is eliminated upon 
taking logarithmic derivatives with respect to x. Besides the imaging problems considered by 
Hyvarinen (2005), score matching has been applied to spatial statistics (Dawid and Musio, 
2013) and neural networks (Koster and Hyvarinen, 2007; Vincent, 2011; Le et ah, 2011). 

The statistical properties of SMEs in classical large sample settings have been investigated 
by Hyvarinen (2005, 2007) and Forbes and Lauritzen (2015). In particular, it has been shown 
that, under the usual regularity conditions, SMEs are asymptotically consistent and normal in 
large-sample theory. However, SMEs are not necessarily asymptotically efficient. 


2.2. Extension to non-negative data 


The partial integration arguments underlying (2.2) may fail to apply when considering dis¬ 
tributions Q that are not supported on all of K”^. In particular, when Q is taken to be from 
7^+, i.e. the family of distributions that are supported on K™ = [ 0 , 00 )™ with Lebesgue den¬ 
sities that are twice differentiable on ( 0 , 00 )*”, then partial integration may not be possible 
due to discontinuities at points with zero coordinates. We thus consider the non-negative score 
matching loss. 


J+{Q) = 



V logq(a;) o x — V logp(x) o x 


dx. 


(2.4) 


as proposed in Hyvarinen (2007). Here, ‘o’ stands for the Hadamard product, that is, element¬ 
wise multiplication. 

The score matching loss (2.1) can be thought of as a function of the Euclidean distance 
between the gradients of the model density q and true density p with respect to a hypothetical 
location parameter /r, evaluated at 0. That is, we may write (2.1) as 


J{Q) 


p(x) 




\ogq{x p)- \7^^ologp{x -f p) 


2 

2 


dx. 


Likewise, the non-negative score matching loss compares the gradient of the model density q 
and true density p with respect to a hypothetical scale parameter a evaluated at 1, 


J+iQ) 


p(x) ||V^=ilogg(xocr) - V^=ilogp(xo(7)| 


dx. 


Under suitably adjusted regularity conditions, Hyvarinen (2007) showed that the non-negative 
score matching loss from (2.4) can be simplified into 


J+((5) = / p{x)S+{x,Q) dx + const 

Jr™ 


(2.5) 


with scoring rule 


S+{x,Q)^Y. 

i=i L 


d\ogq{x) 2^^1059(2;) , 1 2 f d^ogq{x) 

2xj —--ha; --h -a:,- 

OXi 2 


.j 2 ^ \ dxj 


( 2 . 6 ) 


For a data matrix x S one obtains the empirical non-negative score matching loss 

1 " 

J+(x,Q) = -^5+(xW,Q), 
n 


2 = 1 
5 


(2.7) 

















and the non-negative score matching estimator (SME_|_) 

Q+ = arg min J+(x, Q). 

Q 

Again, under the usual regularity conditions, the estimator Q_|_ is asymptotically consistent 
and normal in traditional large-sample theory. 


2.3. Score matching in exponential families 


Hyvarinen (2007) and Forbes and Lauritzen (2015) have shown that the SME has a convenient 
closed form as a rational function of the data when V is an exponential family. Hyvarinen 
(2007) showed the same for SME+ for the example of truncated normal distributions. As they 
provide the basis for our later work, we revisit these results for both SME and SME+. 

Let V = {Qe : 0 G 0) be an exponential family with natural parameter space 0. Suppose 
that the distributions Qg have their common support equal to either X = K™ or A = R™, and 
that V is dominated by Lebesgue measure on K™. Assuming that the sufficient statistics t{x) 
take values in K®, the log-densities of the distributions Qg have the form 

logq{x\9) = 9"'"t{x) — ip{9)-\-h{x), xGX, (2.8) 

and 

0 = je e R® : tP{9) = log J < ooj . (2.9) 

Lemma 1. Let x € R"^™ be a data matrix, and suppose V = {Qg : 9 G Q) is an exponential 
family characterized by (2.8) and (2.9). If V has support X = R™, then the empirical score 
matching loss J{:x.,Qg) is a quadratic function in 9 with 

J(x,Qg) = i 6>^r(x)6»-f g(x)^6»-f c(x), (2.10) 


where r(x) is a positive semidefinite s x s matrix, and g{x.) is an s-vector. The same is true 
for J+(x, Qg) when V has support X = R™. 

Proof. For j = 1,..., m and x G R™, define the s-vectors 


d 


52 

hj3(.x) = ^t{x). 


dx] 


It then follows from (2.8) that j(x, Qg) can be expressed in the claimed form with 

^ n 771 


i=l j=l 
n m 


i=l j=l k 1 / 

c(x) = — ^ - V6(a;^*^) -I- A6(x^*^). 

^ i=l ^ 


( 2 . 11 ) 

( 2 . 12 ) 

(2.13) 
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For non-negative score matching, J+(x, Qg) admits the claimed form with 




i=i i=i 
n m 


i=l 3 = 1 

where the Xij are the entries of the n x m data matrix x. 


(2.14) 


1 _ / f) \ 

, (2.15) 

i=l 3 = 1 i ' 

+ ( 2 - 16 ) 

,_i o-_i \^‘^j / ‘^j 


□ 


Lemma 1 implies that, when working with exponential families, both score matching objec¬ 
tives are quadratic functions of the unknown parameter vector 9. A score matching estimator 
9 thus satisfies a set of linear estimating equations 


0^r(x) -I- 3(x) = 0. 


(2.17) 


2.4- Pairwise interaction models 

The most basic class of exponential families that appear in graphical modeling are pairwise 
interaction models with log-densities 


log(7(a;|6») = ^ 9jktjk{xj,Xk) - ip{9) + b{x), x&XC 


(2.18) 


Here, the tjk are sufficient statistics that depend only on the jth and fcth coordinate of x, and 
the 9jk are interaction parameters. If Qg denotes the distribution with density given by (2.18), 
then the Hammersley-Clifford Theorem implies that an edge between nodes j and k exists in 
the conditional independence graph of Qg if and only if 9jk is nonzero. The specific models 
we consider later either exactly have the form in (2.18) or are closely related extensions with 
log-densities 


logg(a;|6») = 


(2.19) 


a=l j<k 


1^1 j = l 


where pairwise interactions may be of A different types and we also include L sets of sufficient 
statistics t^^'^ depending on the individual coordinates. The latter appear, for instance, when 
allowing distributions to vary in location. The distribution Qg defined by (2.19) has no edge 
between j and k in its conditional independence graph if and only if 9j]^ = ■ ■ ■ = 9jf'^ = 0. 

In our study of score matching methods for models of the type (2.18) or (2.19), it will be 
convenient to introduce the symmetric m x m interaction matrix 0 with entries 


0jfe — 


93 k if j < fc. 


0. 


kj 


if j > k. 


Lemma 2 . Let V to be the pairwise interaction model given by (2.18) with symmetric m x 
m interaction matrix &. If V has support X = M™, then the empirical score matching loss 
J(x.,Qg) equals 

^i'ec(0)'^r(x)wec(0) -I- g{x)'^vec{&) + c(x) 
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( 2 . 20 ) 


for a symmetric mf x matrix r(x) that is block-diagonal, with all blocks of size m x m. 
The same is true for j+(x, Qg) when V has support X = M™. 

Proof. By (2.11) and (2.14), it suffices to show that there exists a block-diagonal matrix rj(a;) 
such that 

6'^hj{x)hj{x)^0 = vec(©)^rj(a;)vec(0), (2-21) 

where 0 = {9jk ■ j < k). Now, 

g Q 

hj{x) 9 = 'g^J'jk{xj,Xk)9jk + 'Q^f'kjixk,Xj)9kj 

k>j ^ k<j ^ 

d d 

— ^ ^ ^ ^jki^^j ■) ^k^^kj ^ ^ ^ ^kj i^^k-) ^j^^kj • 
k>3 k<j 

Define a vector hj{x) € indexed by pairs (kj) with 1 < kj < m, by setting the entries 

to 

{ ^tki{xk,xi) if j = k <1, 

^tikixk,Xj) iij = k>l, (2.22) 

0 if j k. 

Then hj(xy9 = hj{x)vec{&) and (2.21) holds with rj(a;) = hj{x)hj{xy, which is block- 

diagonal as it is zero with the exception of the m x m block indexed by pairs (fc, 1 ) with 
k=j. □ 

Remark 1. When 7^ is a model as specified in (2.19), then the empirical (non-negative) score 
matching loss may still be represented as an explicit quadratic form with a block-diagonal 
symmetric matrix r(x) as in (2.20). However, r(x) is then of size {Arn^ -\-Lm) x {Am? Lm), 
and its m diagonal blocks are of size {Am L) x {Am -|- L). The jth block has its rows and 
columns corresponding to the jth columns of each of 0^^^,..., 0^^^ as well {9^, ■ ■9^^). 

Example 1. If the exponential family is taken to be the family of centered multivariate normal 
distributions with precision matrix K = {Kjk), then the support is T = K™ and 

g(x|K) oc exp | —, a: G M"*. (2.23) 

With 

m 

Vlogg(a;|K) = -Kx, Alogq(x|K) = - 

and dropping a term that is constant in K, the empirical score matching loss from (2.2) takes 
the form 

- tr(K) -h itr(KKW), (2.24) 

where 

W = - 

n ^ 

i=\ 

is the empirical covariance matrix (under knowledge of zero mean). Lemma 2 applies with 
tjk{xj, Xk) = XjXk, in which case the matrix rj(x) constructed in the proof of the lemma does 
not depend on j, other than through the location of the nonzero block. Indeed, (2.20) holds 


with r(x) = Imxm ® W and ^(x) = vec(Imxm)i where Imxm is the m x m identity matrix. 
Clearly, r(x) is positive definite if and only if W is as well. If W is invertible then SME of K 
is K = and coincides with the maximum likelihood estimator. 

Example 2. Consider truncated normal densities of the form 

g(x|K) cx exp | —, x S K™. (2.25) 

Using Kj to denote the jth column of K, it can be shown that the empirical non-negative score 
matching objective is 

-i n m -| 

„ Zm Kj. (2.26) 

i—1 j — 1 


The loss can be written as in (2.10) with r(x) a block diagonal m? x matrix, whose jth 
block is given by 


1 

n 


E 


tXj 2 ^ ^ ^ • 


Moreover, g(x) = 2w + Wdiag, where w = vec(W) and w^i^g = vec(diag(W)). The maximum 
likelihood estimator for K has no closed form due to intractable normalizing constants. 

Example 3. Finally, consider the family of distributions with densities of the form 


g(x|B^^\ B, b) oc exp 






Ea 


jk^j 




■Ea 

i=i 


3-^J 


X e 


(2.27) 


Here, b = (/3i,..., Pm)’^ is an m-vector, and B = (Pjk) and B^^i = are symmetric mxm 

interaction matrices, the latter having a zero diagonal. This family is a class of distributions 
with normal conditionals, with densities that need not be unimodal (Arnold, Castillo and 
Sarabia, 1999; Gelman and Meng, 1991). This family is intriguing from the perspective of 
graphical modeling as, in contrast to the Gaussian case, conditional dependence may also 
express itself in the variances. For conditional independence of Xj and both Pjf^ and P^^p 
need to vanish. 

By Remark 1, the empirical score matching loss for the family from (2.27) can be written 
as a quadratic function with the quadratic term given by block-diagonal matrix r(x) of size 
{2m? + to) X {2m? + to). The blocks are of size {2m -I- 1) x {2m + 1), and the jth block has its 
rows and columns corresponding to the jth columns of B and B^^^ and the jth entry in b. 


3 . Regularized Score Matching 

In this section, we propose the use of regularized score matching for graphical model selection in 
the setting of high-dimensional sparse graphical models. We begin by discussing the proposed 
method and its implementation. Later sections show that, despite the fact that SMEs need 
not be asymptotically efficient in the sense of traditional large-sample theory, regularized score 
matching achieves state-of-the-art statistical performance in high-dimensional problems, all 
the while allowing seemingly complicated non-Gaussian graphical models to be treated in a 
computationally efficient manner. 
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3.1. Methodology 

Building on the ideas underlying methods such as glasso, neighborhood selection and SPACE, 
we augment the score matching loss with a sparsity-promoting penalty. Our focus is on the 
most basic case of an £i penalty but other regularization schemes could be considered instead; 
see also Example 3 below. 

Using the generic representation given in Lemma 1, for an exponential family, the proposed 
method is based on minimizing the objective 

J^(d) = 1 0^r(x)0-5(xfd + c(x) + A||0||i, (3.1) 

where r(x) is positive semidefinite and A > 0 is a tuning parameter that controls the sparsity 
level. Larger values of A yield sparser solutions, and A = 0 gives the nnregularized SME. Since 
r(x) is positive semidefinite, the function J^{9) is convex but in the settings of interest here 
r(x) will be singular and J^{9) will not be strictly convex. 

The regularized score matching objective from (3.1) is similar to the lasso objective in linear 
regression (Tibshirani, 1996), where the function to be minimized takes the special form 

l\\y-X9\\l + \\e\\,, (3.2) 

for a ‘response vector’ y and a ‘design matrix’ X. In the applications we have in mind (3.1) 
cannot be written exactly as in (3.2) because the vector ^(x) is generally not in the column 
span of r(x). However, we may adapt existing optimization methods for lasso to solve the 
regularized score matching problem. Implementation details are given in Appendix A. 

If the considered exponential family is supported on A = K™ and we use the loss from (2.3), 
then we call the minimizer of (3.1) the regularized score matching estimator (rSME). If A = K™ 
and we use the loss from (2.7), then we abbreviate to rSME_|_. In specific instances of graphical 
models, we may apply the £i penalty only to those coordinates of 9 whose vanishing corresponds 
to absence of edges in a conditional independence graph. If the subset £ C {I,..., s} holds the 
relevant coordinates then we use the penalty 

Example 1 (cont.). For the (centered) Gaussian case considered in Example 1, the target of esti¬ 
mation is the symmetric precision matrix K. The conditional independence graph corresponds 
to the pattern of zeros in the off-diagonal entries of K and the rSME is 

K = arg min | - tr(K) + ^tr(KKW)-f A||K||i.„„l, (3.3) 

KGSym„j (2 J 

where W is the empirical covariance matrix and |jK||i „ff = ||K||i£ penalizes only the off- 
diagonal entries indexed hy £ = {{j, k) : j ^ k}. We emphasize that while in this example the 
natural parameter space is the positive definite cone, we propose minimizing simply over the 
entire space of symmetric m x m matrices, denoted by Sym^. As our interest is primarily in 
graph selection, we do not enforce positive definiteness of K, which is in line with methods 
such as SPACE or neighborhood selection; compare Khare, Oh and Rajaratnam (2015). 

We remark that evaluating the function from (3.3) at a nonsymmetric matrix K as well as 
its transpose gives the same value. By convexity, minimizing over all m x m matrices gives 
a solution in Sym^, which then must equal K. 
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Example 2 (cont.). In the truncated normal family from Example 2, the conditional indepen¬ 
dence graph corresponds again to the zero pattern in the off-diagonal entries of the positive def¬ 
inite interaction matrix K. Proceeding in analogy to the Gaussian case, we dehne the rSME+ as 
the minimizer K+ of the objective given by (2.26) with the penalty A||K||i^„ff added on. Again, 
we ignore the positive definiteness requirement and minimize the penalized non-negative score 
matching loss with respect to K G Sym^. 

Example 3 (cont.). For the family of distributions with normal conditionals from Example 3, 
we would like a penalty to induce joint sparsity in the two symmetric interaction matrices B 
and because an edge between nodes j and k is absent from the conditional independence 
graph if and only both B and B^^^ have their {j, k) entries zero. For this purpose, it is natural to 
adopt the group lasso penalty (Yuan and Lin, 2006). The rSME is then obtained by minimizing 
the empirical score matching loss augmented by the penalty 

Ignoring again any rehned constraints from the natural parameter space of the family, we 
propose minimizing the penalized loss with respect to b G K™ and B,B(^) G Sym^. Since 
the group lasso is applied with small groups (of size 2), the problem would be suitable for 
application of exact block-coordinate descent as discussed in Foygel and Drton (2010a). 

3.2. Uniqueness of rSME 

In the setup from Lemma 1, we may write 

r(x) = H(x)^H(x) (3.4) 

for an nmx s matrix iT(x); recall (2.11) and (2.14). Based on the arguments leading to Lemmas 
3 and 5 in Tibshirani (2013), the function J^{0) from (3.1) has a unique minimizer 9 as long 
as A > 0 and the columns of H (x) are in general position. To clarify, suppose that U C K""* 
is a collection of \hl\ = s vectors. Then lA is in general position if for all k < minjnTO, s}, 
all choices of vectors Ui,... ,Uk+i G U and signs CTi, ..., G {—1,1}, the affine span of 
CTiUi,..., ak+iUk+i does not contain any vector u or —u for u GU \ (mi, ..., Uk+i}. 

The graphical models we are interested in are pairwise interaction models that have addi¬ 
tional special structure in that the matrix r(x) is block-diagonal with m blocks of equal size; 
recall Lemma 2 and Remark 1. Denote the diagonal blocks by ri(x),..., rm(x), which in the 
setup from (2.19) are of size [Am? + Lm) x [Am? + Lm). Each block is the sum of n symmetric 
rank one matrices and we have the decomposition 

TjW = Hj(x)^Hj(x), j = l,...,m. (3.5) 

The n columns of each of the matrices Hj(x) were specified in (2.22). It now holds that the 
regularized score matching problem from (3.1) has a unique minimizer provided each one of 
the n X [Am + L) blocks Hi(x),..., Hm(x) defined in (3.5) has its columns in general position. 
Example 1 (cont.). In the Gaussian case, Hi(x) = ••• = H,„(x) = x. By the Lemma in 
Okamoto (1973), the set of matrices x that fail to be in general position has measure zero. The 
rSME K is unique almost surely when data are generated from a continuous joint distribution. 
Example 2 (cont.). In the truncated normal case, Hj(x) is equal to the matrix obtained from 
X by multiplying each column element-wise with Xj^ the jth column of x. The Lemma in 
Okamoto (1973) implies that the rSME_|_ is unique almost surely. 

For the normal conditionals model from Example 3, almost sure uniqueness would have to 
be derived by appealing to results on uniqueness of group lasso (Roth and Fischer, 2008). 
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3.3. Piecewise linear paths 


The rSME depends on the regularization parameter A. In this section we make this explicit and 
denote it by 9^. Adopting standard language, we refer to the set of 9^ obtained by varying A as 
the solution path and call this path piecewise linear if there exists 0 = Aq < Ai < ... < A/j = oo 
and ^ 0 ) ■ ■ •) ^R-i S K™ such that 9^ = + (A — Xr)^r for A G [A^, Ar+i]. Piecewise linear 

solution paths have the appeal that the entire solution path can be found by calculating the 
change points Xr and associated slopes ^r- 

The next lemma is a consequence of the quadratic nature of the score matching objective 
for exponential families, and holds for the lasso problem as well. 

Lemma 3 . The solution path 9^ for the regularized score matching problem from (3.1) is 
piecewise linear. 


Proof. An s-vector z belongs to 9||0||i, the subdifferential of the £i norm, if 

fsign(6»j) ii9jy£0, 


z, = 


|g[-1,1] if9j=0. 

The Karush-Kuhn-Tucker (KKT) conditions characterizing optimality in (3.1) are 

r(x)0 - g(x) + Az = 0, z G d\\9\\i. 

The linear relationship between 9 and A (for “fixed” z) implies the claim. 


(3.6) 


(3.7) 

□ 


While straightforward to show, the property of piecewise linear paths is special to the 
score matching method we propose. Other methods that give symmetric estimates of precision 
matrices in Gaussian graphical models, such as glasso or the SPACE-type methods discussed 
in Khare, Oh and Rajaratnam (2015) do not have piecewise linear solution paths. This said, 
piecewise linear paths also arise in neighborhood selection (Meinshausen and Biihlmann, 2006) , 
which, however, is a formulation without symmetry. Note also that when using a group lasso 
penalty as suggested for Example 3, rSME solution paths are no longer piecewise linear. 
Example 1 (cont.). In the Gaussian model, the KKT conditions state that K is a solution to 
(3.1) if and only if 

(Imxm G W) vec(K) - vec (Imxm) + Az = 0 (3.8) 

for z G 9||K||i^off, which in slight abuse of notation, we take to mean that 

f 0 a j = k, 

Zjk = S sign(%fc) if K]k 0 and f ^ fc, (3.9) 

[g [—1,1] if kjk = 0 and j ^ k. 

The first case accounts for the fact that the objective is smooth in the diagonal entries of the 
precision matrix, which are not penalized. Combining (3.8) and (3.9), we have that 

771 

-1 ^^^Wjkkjk = 0 , 

k^l 

m 771 

Wjikik + Wktktj + Xzjk = 0, 

1=1 i=l 

A Gaussian solution path is shown in Figure lb, with the horizontal axis transformed to 
— 12j^k \^jk\- data were drawn from a multivariate normal distribution with the 
conditional independence graph from Figure la, with sample size n = 12. We note that, as one 
would hope, the coefficient that last enters the solution corresponds to the absent edge (1,4). 


j = l,...,m, 

(3.10) 

1 < j ^ k < m. 

(3.11) 
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Fig 1: (a) A conditional independence graph with m = 4 nodes, (b) rSME solution path for Gaussian 
graphical modeling (m = 4, n = 12). 


3.4- Tuning 

A number of methods have been proposed for selecting the regularization parameter A in £i 
penalization methods and can be applied in our context. On the one hand, a predictive assess¬ 
ment as in cross-validation can be considered, but the selected graphs are typically too dense. 
Other possibilities include generalized cross validation (GCV) (Tibshirani, 1996), Akaike’s In¬ 
formation Criterion (AIC), approaches based on stability under resampling (Meinshausen and 
Biihlmann, 2010; Shah and Samworth, 2013; Liu, Roeder and Wasserman, 2010), the Bayesian 
Information Criterion (BIC) (Schwarz, 1978) as well as extensions of BIC proposed to cope 
with large model spaces (Chen and Chen, 2008; Cao et ah, 2012; Foygel and Drton, 2010b; 
Barber and Drton, 2015). The latter come with some consistency guarantees. 

As a demonstration, for the Caussian case from Example 1, we may consider an extended 
BIC criterion based on the basic score matching loss (2.2), defined as 

BIC(A) = -2tr(K^)-htr(K^K^W)-h |A^|logn-h4|A^|7logTO, (3.12) 

where = {(j, k) : ^ 0,j < k} and 7 is typically taken to be 1/2 or 1. Alternatively, we 

could refit, that is, replace K'*' by an unregularized SME computed in the submodel given by 
constraining all Kjk with (j, k) ^ to be zero. In either case, we choose A to minimize (3.12). 

4. Numerical Experiments 

We perform numerical experiments comparing regularized score matching to existing methods 
when data is simulated from (i) a multivariate normal distribution, (ii) a multivariate trun¬ 
cated normal distribution, and (iii) a distribution with normal conditionals. The comparison 
is made against three methods for estimation of Caussian graphical models, namely, glasso, 
neighborhood selection (both implemented in the R packages huge) and SPACE (in its CON¬ 
CORD formulation, with R package gconcord). In addition, we consider the nonparanormal 
SKEPTIC, which applies glasso to a matrix of rank correlations (Kendall’s r or Spearman’s 
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p) and can be motivated by a Gaussian copula model (Liu et al., 2012). We utilize the version 
based on Kendall’s t. Finally, we compare to SPACEJAM (Voorman, Shojaie and Witten, 
2014), which is based on additive modeling of conditional means and implemented in the R 
package space jaun. We conclude this section with brief investigations on the robustness of reg¬ 
ularized score matching when data is not generated under the assumed model. All results in 
this section are based on averaging over 100 independently generated datasets. 

4 . 1 . Gaussian data 

We consider a graph with m = 1000 nodes, composed of 10 connected components, each 100 
nodes in size and structured as a 10 x 10 2-D lattice (4 nearest neighbors). Each connected 
component also features three hubs with node degree 20, randomly selected from the subset of 
nodes in the component. 

We follow a procedure similar to the one from Peng et al. (2009) to convert the adjacency 
matrix of the graph into a sparse diagonally dominant partial correlation matrix. For each non¬ 
zero element of the adjacency matrix, we sample a draw from a uniform distribution on [0.5,1]. 
Each row of this new matrix is then rescaled by 1.5 times the sum of the absolute values of the 
off-diagonal entries in the row. We average this matrix with its transpose to ensure symmetry, 
and set its diagonal elements to 1. This matrix is inverted and converted into a correlation 
matrix to form S*. 

Data is then generated from a multivariate normal distribution with mean zero and a co- 
variance matrix S*. We choose sample size n = 600 and 1000. The setup agrees with that in 
Peng et al. (2009), except that the number of nodes has been scaled up. 

Figure 2 shows the ROC curves obtained under both sample sizes. Since the truth is Gaus¬ 
sian, we do not report results for SKEPTIC or SPACEJAM. For both sample sizes, the curve 
for regularized score matching almost perfectly aligns with those for neighborhood selection, 
SPACE, and glasso. The results indicate that regularized score matching estimators achieves 
state-of-the-art statistical efficiency in Gaussian models. 

4-2. Non-negative Gaussian data 

Glasso, SPACE, neighborhood selection and SKEPTIC all presume some form of underlying 
Gaussianity. In this and the next subsection, we demonstrate the application of regularized 
score matching in scenarios where these assumptions do not hold to highlight the versatility of 
the proposed appraoch. 

Similar to the Gaussian setting, we consider a graph with m = 100 nodes, composed of 
10 disconnected subgraphs with equal number of nodes. Using the lower triangular elements 
adjacency matrix of each 10 node subgraph, we construct ten matrices, where in each matrix, 
the element is drawn independently to be 0 with probability 0.2, and from a uniform distribu¬ 
tion on [0.5,1] with probability 0.8. The matrices, after symmetrization, are combined into a 
100 X 100 block matrix. The diagonal elements are set to a common positive number such that 
the minimum eigenvalue is 0.1 to form the precision matrix of the pre-truncated normal, K*. 

Data was then generated from a truncated centered multivariate normal, left-truncated at 0 
and with S* = (K*)“^ as normal covariance. We used the Gibbs sampler from the tmvtnorm 
package in R with a burnin period of 100 samples. We thinned out the remaining samples, 
keeping one in ten. The sample size n is taken to be either 2500 or 5000. The need for a larger 
sample size is explained by our theoretical findings in Section 6, specifically Corollary 2. 

The ROC curves are shown in Figure 3, where regularized score matching outperforms all 
competitors considered. The closest competitor to regularized score matching are SKEPTIC 
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(a) n = 600 (b) n = 1000 

Fig 2: ROC curves for the Gaussian case. The dashed grey line represents random selection of edges. 

The color to method correspondence is as follows: regularized score matching ( - ), neighborhood 

selection ( ), glasso ( - ), and SPACE ( - ). The curves are almost perfectly aligned. 


and SPACEJAM, both of which, objectively, perform well, being capable of capturing some of 
the non-Gaussianity in the data. 

We emphasize that here score matching was applied in its non-negative version from Sec¬ 
tion 2.2. The basic score matching procedure from Section 2.1 is far less efficient based on 
experiments not reported here. 

4-3. Normal conditionals 

Next, we take the data-generating distribution to have a density from the class 

{ m ml 

X! + X! + X! p 3: e M"*, (4.1) 

lAfc i=i J 

where B = {/3jfc} is a symmetric matrix with diagonal entries 0. This family is a special case 
of the distributions with normal conditionals from Example 3. 

We consider the case m = 625, with the graph being a 25 x 25 2-D lattice (4 nearest 
neighbors). The true interaction matrix B* is constructed by multiplying the adjacency matrix 
by —1/25. The coefficients for the terms a;| are all set equal to —1 and those for the Xj all 
equal to 8/50, which makes the marginal distributions deviate noticeably from Gaussianity. 
Data can be generated by Gibbs sampling using the Gaussian full conditionals. We discard the 
first 100 samples and thin out the remaining samples, keeping one in ten, as in Section 4.2. 

We plot the ROG curves for conditional normal data in Figure 4. Regularized score matching 
outperforms its competitors by a clear margin. This is not surprising, as both glasso and 
SPAGE are derived under normality. A Gaussian copula model as underlying SKEPTIG is of 
little help. SPAGEJAM does best among the competitors but cannot fully extract the available 
signal about the edge structure as the conditional means are non-additive and the conditional 
variances are not constant. 
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(a) n = 2500 (b) n = 5000 

Fig 3: ROC curves for the non-negative Gaussian case. The dashed line represents random selection 

of edges. The color to method correspondence is as follows: regularized score matching ( - ), glasso 

( - ), SPACE ( - ), SKEPTIC ( - ), and SPACEJAM ( - ). 

4 .4- A robustness check 

It is of interest to see how score matching performs when the data-generating mechanism is 
misspecified. We consider two scenarios. First, we apply the Gaussian score matching to a 
contaminated Gaussian setting similar to that explored in Finegold and Drton (2011). That 
is, a random subset of Gaussian observations is replaced with Gaussian noise. In the second 
example, we investigate the performance of the regularized Gaussian score matching when the 
observations are not Gaussian but rather drawn from a multivariate t-distribution. 

4.4- 1- Contaminated Gaussians 

We mimic the setup used in the numerical experiments in Finegold and Drton (2011), who 
consider these settings to test the robustness of their tlasso. Fixing m = 200, we construct a 
sparse precision matrix K* according to the following steps: (1) choose each (strictly) lower 
triangular element of K* to be independently -1, 0, 1 with probability 0.01, 0.98 and 0.01 
respectively, (2) symmetrize the matrix (3) for each row, i.e. for j = 1,... ,m, set = 1-1- 
IIK* II0 where k* _j refers to the jth row of K* with the diagonal element in that row removed. 
To strengthen partial correlations, the diagonal elements are scaled down by a common positive 
factor such that the minimum eigenvalue of the resulting matrix is approximately 0.6 (close to 
0.62 in our setup). The covariance matrix S* is obtained by inverting K*. 

We generate either n = 150 or n = 200 observations from a multivariate normal distribution 
with mean zero and a covariance matrix S*. We then corrupt 2% of the observations, substitut¬ 
ing them with i.i.d. N{0, 0.2) draws. The corrupted observations cannot easily be differentiated 
from normal observations, and this elevates the difficulty of the estimation problem. 

We present the ROG curves in Figure 5. Interestingly, score matching performs reasonably 
well, on par with SKEPTIC and neighborhood selection. For both sample sizes, the differences, 
which are subtle, are most apparent in the regime where the number of false positives detected 
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Fig 4: ROC curves for the normal conditionals case. The dashed line represents random selection of 

edges. The color to method correspondence is as follows: regularized score matching ( - ), glasso 

( - ), SPACE ( - ), SKEPTIC ( - ), and SPACEJAM ( - ). The curve for glasso overlaps with 

the curve for SPACE. 


is small: score matching falls slightly short of neighborhood selection, but it also appears to 
slightly outperform SKEPTIC. Surprisingly, there is a clear margin of difference between the 
performances of regularized score matching and SPACE, the former outperforming the latter, 
despite their noted structural similarities. Glasso, which utilizes the full Gaussian likelihood, 
performs the worst. Overall, we conclude that regularized score matching is competitively 
robust when compared to its alternatives in the contaminated Gaussian setting. 


Multivariate t-distributed observations 

In this section, we apply regularized Gaussian score matching to observations arising from a 
multivariate t-distribution with mean 0 and covariance matrix S*. This corresponds to testing 
the robustness of regularized score matching under model misspecification. Like in the previous 
section, we consider the case when m = 200. To set up S*, we construct a m x m adjacency 
matrix based on an Erdds-Renyi graph with the probability of drawing an edge between any 
two arbitrary nodes set to 0.01. We then convert the adjacency matrix into S* using the 
same procedure as in Section 4.1. Samples were drawn from a multivariate t-distribution with 
covariance matrix S* and three degrees of freedom. 

The ROG curves are plotted in Figure 6 for n = 100 and n = 150. As expected, SKEP- 
TIG outperforms all others, owing to its flexibility to accommodate outliers, as previously 
demonstrated in Liu et al. (2012). In fact, for elliptical distributions, such as the multivariate 
t-distribution, Kendall’s r allows for consistent estimation of S*, so SKEPTIG should perform 
optimally (Liu, Han and Zhang, 2012). Nonetheless, regularized score matching is reasonably 
robust under this setting: its performance is comparable to that of SPAGEJAM - only falling 
slightly short - SPAGE, and neighborhood selection. Again, glasso yields the poorest results. 
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(a) n = 150 (b) n = 200 

Fig 5: ROC curves for the contaminated Gaussian case. The dashed line represents random selection 
of edges. The color to method correspondence is as follows: regularized score matching ( - ), neigh¬ 
borhood selection ( ), glasso ( - ), SPACE ( - ), SKEPTIC ( - ), and SPACEJAM ( - ). 


5. Application to RNAseq Data 

The American Cancer Society estimates that in 2015 there will be 220,800 new cases of prostate 
cancer and 27,540 deaths. To understand how the cancer develops, as well as how it may be 
treated, it is necessary to decipher the genetic machinery which drives it. Since cancer is such 
a complex disease, it is insufficient to study a single gene at a time, as genes may interact with 
one another in many ways. Graphical modeling of gene expression data has the potential to 
aid in discovery of such interactions. 

RNAseq data from next-generation sequencing technology can be used to identify genes that 
are activated/transcribed or suppressed at the time of measurement. However, RNAseq data 
are non-negative and have skewed marginals, which presents a challenge for existing method¬ 
ologies. Graphical models based on truncated Gaussian models are interesting alternatives to 
existing approaches that primarily consist of applying Gaussian methods after transformations. 
Whether truncation models are truly useful scientifically deserves a fuller exploration; here we 
simply illustrate how different estimates can be obtained from the proposed methodology. 

Our case study is based on the RNAseq data from 487 prostate adenocarcinoma samples 
available in The Cancer Genome Atlas dataset. We focus on 350 genes that belong to “known” 
cancer pathways in the Kyoto Encyclopedia of Genes and Genomes. Removing genes with more 
than 10% missing values, we obtained a dataset with m = 333 genes. Remaining missing values 
were simply set to zero, adding to the challenge. (We will comment on the issue of missing data 
in the discussion.) In illustration of the regularized score matching methodology, we consider 
an exponential family of truncated normal distributions with density 

q{x\^,'K) (x exp , a: e K™. 

This generalizes the family of distributions considered in Example 2 by allowing the truncated 
normal distribution to have nonzero mean. 
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(a) n = 100 (b) n = 150 

Fig 6: ROC curves for the t-distributed case. The dashed line represents random selection of edges. 

The color to method correspondence is as follows: regularized score matching ( - ), neighborhood 

selection ( ), glasso ( - ), SPACE ( - ), SKEPTIC ( - ), and SPACEJAM ( - ). 


We compare regularized non-negative score matching, SPACE (using CONCORD formu¬ 
lation), glasso, SKEPTIC and SPACEJAM. We apply SPACE and glasso directly to the 
standardized data. We do not consider any marginal transformations as they are naturally 
accounted for when comparing to the rank correlation-based SKEPTIC. For each method, we 
tune the regularization parameter A in order to obtain \E\ = 333 (or 334) edges. Figure 7 
depicts the estimated networks, with isolated nodes removed, in layouts optimized for each 
graph. To allow for easier comparison, we also show the estimated networks in fixed layouts in 
Figure 8. Node degree distributions are plotted in Figure 9. 

By visual inspection, glasso and SKEPTIC give similar topologies, which can be explained 
by the fact that both are derived from the full Gaussian likelihood. Interestingly, we observe 
that SPACEJAM and SPACE likewise yield similar graphs, which reinforces findings from 
Shojaie and Sedaghat (2016). Regularized non-negative score matching yields a graph that is 
fairly different from the rest. 

While the usefulness of these models remains to be further explored, our case study demon¬ 
strates that regularized score matching can provide estimates that differ in interesting ways 
to the estimates generated by other methods. We compile a list of the top ten most highly 
connected genes in each of the estimated graphs in Table 1 (some lists have more than ten 
genes due to ties), as there is strong evidence that highly connected nodes play important roles 
in biological networks (Carter et ah, 2004; Jeong et ah, 2001; Han et ah, 2004). There are slight 
overlaps between the lists. Upon further inspection, we observe that six of the ten genes listed 
under regularized score matching have been previously linked to prostate cancer, five of which 
have not been identified by the competing methods: 

• CCNE2 (cyclin E2): a protein which is required for transition of the Gi to S phase of 
the cell cycle, which determines cell division. Regulated by PTEN, a tumor suppressor, 
it is over-expressed in metastatic prostate tumor cells (Wu et ah, 2009). 

• BRCA2 (breast cancer 2): mutations in the BRCA2 gene have been associated with 
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(a) Reg. score matching 


(b) SPACE 


(c) SPACEJAM 



(e) SKEPTIC 


Fig 7: Topology of inferred networks of \E\ = 333 or 334 edges for all considered methods. The layout 
has been optimized for each graph. Isolated nodes are not shown. Red colored nodes have degree 
greater or equal to 10. 



early-onset prostate cancer in men; men carrying mutations have a predisposition to more 
aggressive phenotypes (Gayther et ah, 2000; Mitra et ah, 2008; Tryggvadottir et ah, 2007; 
Fan et ah, 2006). 

• BIRC5 (survivin): a protein which prevents cell death, or apoptosis, and regulates cell 
division. Heightened expression has been found to be associated with higher hnal Gleason 
score, i.e., more aggressive cancer and worse prognosis (Kishi et ah, 2004; Shariat et ah, 
2004). 

• SKP2 (S-phase kinase-associated protein 2, E3 ubiquitin protein ligase): a positive regu¬ 
lator of the Gi to S phase of the cell cycle, which determines cell division. SKP2 labelling 
frequency in cancer was positively correlated with the Gleason score, and shown to be a 
significant predictor of reduced recurrence-free survival time after radical prostatectomy 
(Yang et ah, 2002; Wang et ah, 2008). It has been proposed elsewhere as a promising 
therapeutic target for prostate cancer (Wang et ah, 2012). 

• STAT5B (signal transducer and activator of transcription 5B): a transcription factor that 
encourages metastatic behavior of human prostate cancer cells. Its inhibition has been 
shown to induce apoptosis in human prostate cancer cells (Gu et ah, 2010; Ahonen et ah, 
2003; Moser et ah, 2012). 
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(a) Reg. score matching 




Fig 8: Topology of inferred networks of \E\ = 333 or 334 edges for all considered methods. Layout of 
nodes is fixed across graph estimates and was optimized for the SPACE estimate. Isolated nodes have 
now been inclnded. Red colored nodes have degree greater or eqnal to 10. 


Furthermore, via the Kolmogorov-Smirnov test, we fail to reject the hypothesis that the 
degrees of the nodes for the regularized score matching graph estimate follow a power law 
distribution, with significance level of 0.05. On the other hand, we reject this hypothesis for 
all other generated estimates at the same significance level. There is evidence that genetic 
networks are ‘scale-free’, which implies that their degree distribution can be approximated by 
a power law distribution (Albert, 2005; Barabasi and Albert, 1999; Jeong et ah, 2001). In this 
aspect, the topology of regularized score matching estimate is most similar to the hypothesized 
structure of gene networks. 

Finally, we would like to emphasize that we do not intend to claim that regularized score 
matching provides the best estimate of the underlying gene network, as the truth is unknown 
to us. What we can posit is that truncated Gaussian may be a useful model that provides 
potentially valid targets for therapy which may be missed by other methods. 

6. Theory 

This section establishes high-dimensional model selection consistency (sparsistency) of regular¬ 
ized score matching. We focus on pairwise interaction models as in (2.18), although our results 
could be extended to more general models. Theorem 1 below identifies general deterministic 
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Fig 9: Node degree distributions for inferred networks of |i5| = 333 or 334 edges for all considered 
methods. 


conditions on data that yield sparsistency of regularized (non-negative) score matching. Two 
subsequent corollaries make probabilistic statements about sparsistency in the Gaussian and 
the non-negative Gaussian case. Proofs are given in Section 7. Experiments that corroborate 
the theoretical findings are shown in Appendix G. 

Before stating the main results, we describe a key assumption for model selection consistency 
of £i-penalized estimators, the irrepresentability assumption, and highlight differences between 
various estimators of Gaussian graphical models with respect to this assumption. 

6.1. Setup and notation 

We consider a continuous pairwise interaction model as given by (2.18) with symmetric m x 
TO interaction matrix © = (dj/c). We let 9 = vec(©). Then the regularized score matching 
estimator, in its basic or non-negative version, is 

0 = argmin ^9^F{x.)9 + 9 + c(x) -|- A||0|ji. (6.1) 

9 L 

By Lemma 2, r(x) is a symmetric rv? x rr? matrix that is block-diagonal, with blocks of size 
TO X TO. For notational convenience, we drop the explicit reference to the data matrix x and 
denote r(x) and g(x) as F and g. 

The true data-generating distribution is assumed to belong to the considered model. We 
denote the true interaction matrix by 0* = {9*i,) and its vectorization by 9*. We define F* 
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Reg. score matching 

Glasso 

SKEPTIG 

SPAGE 

SPAGEJAM 

CCNE2 (19) 

EP300 (20) 

PIK3GA (23) 

TRAF6 (9) 

BHX (10) 

PIK3CG (16) 

SCSI (17) 

FZD7 (18) 

TPR (9) 

SOS2 (9) 

BRCA2 (13) 

BAD (16) 

PDGFRB (17) 

SOSl (9) 

TRAF6 (8) 

BIRC5 (12) 

TPR (13) 

TGFBR2 (16) 

JAKl (9) 

TGFBR2 (8) 

SKP2 (10) 

RBXl (13) 

TGEB2 (16) 

EP300 (9) 

SOSl (8) 

PIK3CD(10) 

PIK3GD (12) 

MMP2 (16) 

SOS2 (8) 

RRM2 (8) 

LAMB3 (10) 

LAMA4 (12) 

LAMA4 (16) 

EGFR (8) 

PDGFRB (8) 

STAT5B (9) 

HRAS (12) 

GLI2 (15) 

GBL (8) 

EP300 (8) 

HRAS (9) 

GLI2 (12) 

SOSl (14) 

BAX (8) 

PIK3GA (7) 

PDGFRB (8) 

GSTPl (8) 

TRAF6 (11) 
TGFBR2 (11) 
TGEB2 (11) 
SPIl (11) 

SOS2 (11) 
PDGFRB (11) 
MAP2K2 (11) 
APPLl (11) 

PDGFRA (14) 
MITF (14) 
EP300 (14) 

Table 1 

APPLl (8) 

ARNT (7) 


The most densely connected genes according to the estimated graphs generated via nonnegative regularized 
score matching, glasso, SKEPTIC, SPACE and SPACEJAM. The number in parenthesis corresponds to the 

estimated degree of the gene. 


and g* to be the expected values of T and g. The support of 9*, that is, 

S = S{9*) = {{3,k)-.j^k, 

is the edge set of the true conditional independence graph. Similarly, 

S=S{e) = {{j,k):j^k, 

determines the graph inferred by regularized score matching. Finally, we write d for the max¬ 
imum degree of the m nodes of the conditional independence graph. In other words, d is the 
maximum number of nonzero off-diagonal entries in any row (or column) of 0 *. 

6.2. Irrepresentability 

We say that the irrepresentability (or mutual incoherence) condition holds with incoherence 
parameter a if the following assumption holds. 

Assumption 1. There exists an a G (0,1] such that 

|||rscs(r*ss)-'IL <(!-«)• (6-2) 

Irrepresentability conditions play a key role in the analysis of £i regularization techniques 
(Biihlmann and van de Geer, 2011). For neighborhood selection in Gaussian graphical models, 
it has been formulated in terms of the covariance matrix S* (Meinshausen and Biihlmann, 
2006). In the theoretical analysis of the glasso, the constraint is placed on the Hessian of the 
log-determinant of the precision matrix K*, i.e., (K*)“^ 0 (K*)“^ (Ravikumar et ah, 2011). 

In order to highlight the differences in conditions required for sparsistency of glasso, neigh¬ 
borhood selection, SPAGE and regularized score matching, we revisit the Gaussian graphical 
model example in Meinshausen (2008). Let p G (0, l/-\/2), and let S = (uij) be the 4x4 
covariance matrix with ones along the diagonal, (T 23 = cr 32 = 0 , cri 4 = tT 4 i = 2 p^ and all other 
off-diagonal entries equal to p. The precision matrix K = (S)“^ then has ft:i 4 = K 41 = 0. The 
conditional independence graph G is as in Figure la. 
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Meinshausen showed that for samples drawn from A^(0, S), glasso can consistently recover G 
only if p < -^3/2 — 1 « 0.23. For neighborhood selection, the corresponding necessary condition 
is p < 0.5. If these conditions fail, then for large sample size, the probability of erroneously 
including the edge (1,4), i.e., P (ki 4 ^ 0) can be shown to be at least 0.5. It turns out that 
for regularized score matching, the analogous necessary condition gives a bound that falls in 
between 0.23 and 0.5, specifically, p < y/2 — 1 « 0.41. 

We observe that glasso, which yields positive definite estimates, requires the most stringent 
condition. When working with symmetric matrices as in regularized score matching, the con¬ 
dition is markedly relaxed. Allowing non-symmetric matrices in neighborhood selection leads 
to further relaxation of the condition. Interestingly, the pseudo-likelihood methods classified 
under SPACE have the same necessary condition as score matching. 

Assumption 1 should be seen as sufficient for consistency of regularized score matching. For 
Meinshausen’s example, it can be shown to amount to p < — 1) « 0.37. The analogous 

sufficient condition for glasso from Ravikumar et al. (2011) requires that p < ^{y/2— 1) « 0.21. 
For neighborhood selection, the condition is p < 0.5. 

6.3. Main Results 
We define 


cr* = lll(rss) ^lll^, and c®. = |||0*|||^. 

Moreover, let 

Ri = (r-r*), r2=g*-g, r3 = T*e*-g* 

such that the KKT conditions from (3.7) can be written as 

T*{e - e*) + Rid + r 2 + r 3 + Xz = 0, zed\\9\\i. 


(6.3) 


(6.4) 


(6.5) 

Theorem 1. Assume that Tgg is invertible and the irrepresentability condition holds with 
incoherence parameter a G (0,1] (Assumption 1). Furthermore, assume that 


IRlIloo < Cl, 


F2II00 < £ 2 , 


with dti < a/(6cr*). If 


X 3(2 - a) . , 

A> -max|c®*ei, £ 2 ), 


( 6 . 6 ) 

(6.7) 


then the following statements hold: 

(a) The rSME 9 is unique, has its support included in the true support (S C S), and satisfies 


(b) If 


\\e-o*\u<^x. 

2 — a 


..min > JiEM A, 

l<j<k<7n 2 — Q 


then S = S and sign(0^fc) = sign(0*j,) for all (j, k) € S. 
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Theorem 1 imposes deterministic conditions on the data, namely, the bounds in (6.6). In the 
following corollaries, we will consider specific distributional assumptions and impose population 
conditions that imply bounds of the form (6.6) with high probability. 

First, we provide a result for regularized score matching for the Gaussian case (Example 1), 
which has F = Imxm ® W with W being the sample covariance matrix, and g = vec(Imxm)- 
When the data is generated from a normal distribution with covariance matrix S* then F = 
Imxm G S* and, of course, g* = g = vec(I„xm)- 

Corollary 1. Suppose the data is generated from a normal distribution A^(0, S*) such that 
Fgg is invertible and irrepresentability holds for a G (0,1]. Let K* = (k*^,) = (S*)“^, 


c* = 3200 max (S),)^ 


and 


4 

Cl = -cr*. 
a 


Take any ti >2. If the sample size satisfies 

n > c*cld^ {log + log 4), 


( 6 . 8 ) 


and the regularization parameter is 


X > 


2ck* (2 — a) c* (log + log 4) 
a V n 


(6.9) 


then the following statements hold with probability 1 — 1/m 


Tl-2 . 


(a) The rSME K from (3.3) is unique, has its support included in the true support (S C S), 
and satisfies 


(b) If 


|lK-K*||o.<^A. 

2 — a 

1*1^ , 

mm bt.i. > -A, 

l<j<k<m ^ 2 — CX 


then S = S and sign(Kjfe) = sign(«;*j,) for all {j, k) G S. 

The corollary is proven in Appendix 7.2. Numerical experiments reported in Appendix C 
suggest that the sample size n indeed needs to scale at least ^{df logm) for sparsistency. 

From Theorem 1, we can also derive an analogous result for regularized non-negative score 
matching for the truncated Gaussian case (Example 2). The result requires the sample size to 
be larger than in the Gaussian case, due to the need to control higher order moments. Recall 
that here, F(x) a block diagonal m? x wf matrix, with the jth block given by 


1 

n 


E 

2=1 




Xi)T 


and g = 2w + where w = vec(W) and Wdiag = vec(diag(W)). 

Corollary 2. Suppose the data is generated from a non-negative Gaussian distribution with 
parameter K*, i.e., N(0, (K*)“^) is truncated to K™. Suppose further that Fjg is invertible 
and irrepresentability holds for a G (0,1]. Let 


c = max 


max Var[A^], 


L 

2 
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max Var[A|] 


and C 2 


6 

-Cr* 

a 









( 6 . 10 ) 


where L > 0 is an absolute constant. Take any T 2 > S. If the sample size satisfies 

n > c **+ log2)®, 


and the regularization parameter is 


^ 3(2-a) , 

A > -maxjcK*, 1| 


a 


c** (log +log2)® 


( 6 . 11 ) 


then the following statements hold with probability 1 — ■ 

fa) The rSME'K+ based on penalizing (2.26) with A|jK||i_„j is unique, has its support included 
in the true support (S C S), and satisfies 


|K+ -K* 


< 


cr- 

2-a 


A. 


(b) If 


mm \K 

l<j <.k<m 


'jk 


>^A, 


then S = S and sign((K+)j 7 j) = sign(K*^.) for all {j, k) € S. 


The proof of the corollary, which is given in Section 7.3, uses general tail bounds that apply 
to log-concave measures. The lower bound for n given in (6.10) could well be suboptimal 
and a lower power of logm may be sufficient for sparsistency. However, the experiments in 
Appendix C suggest that the exponent for logm cannot be taken too much smaller than 8. 

We also compared the lower bound we obtained for the non-negative Gaussian case to 
a result implied by the work of Yang et al. (2013) who treat consistency of neighborhood 
selection in a general framework that allows node-wise conditional distributions to arise from 
exponential families. Interestingly, when working out what their general theorem would say 
about the above non-negative Gaussian model we found that the sample size n would also be 
required to be at least n(d^(logm)®). Our result from Corollary 2 is thus at least comparable 
to existing results in the literature. 


7. Proofs 


7.1. Proof of Theorem 1 


First, we note that claim (b) is an immediate consequence of claim (a). To show (a), we apply 
the primal-dual witness method (PDW) from Wainwright (2009). As explained in detail below, 
PDW entails construction of a pair (0, z), with 9 G K™ and z S 9||6>||i, that satisfies the KKT 
optimality conditions from (6.5) and has the support of 9 included in S. If the construction 
is successful then it ensures that the rSME problem admits a unique solution such that the 
rSME 9 is equal to 9 and inherits all the properties the latter has by definition. These properties 
include the bound on estimation error in addition to the claim about the support. 

Replacing P by P* and g by g* in the empirical (basic or non-negative) score matching loss 
recovers the population loss which, in the present exponential family context, is quadratic and 
minimized when 9 — 9*. (Recall that the score matching loss is consistent.) It follows that r^ 
from (6.4) is zero as it is the gradient of the population loss. In block form, (6.5) becomes 


^ SS ^ SS‘= 


'0s-0*s- 

-r-i* 


_9s^ - 9go_ 


R-i,ss 

R-i,S‘=s 



Ss 

+ 

r2.s 

+ \ 

zs 


o' 


9s'=_ 


f2,S<=_ 


_ZS'=_ 


0 


(7.1) 


We construct the PDW pair {9,z) according to the following steps: 

26 






















(i) Take 9 to be the unique solution to the support-restricted problem, that is, 

0 = arg min -O'^TO — 9 + \\\9\\i. 

Ggc—O 2 


(7.2) 


(ii) Choose 
(hi) Solving (7.1), set 
zs- 


zs e i9||0s||i. 

-T*s.s(X*ss)-^ {^l,SS0S + r2,s) 

+ R-1,S=S^S + + Argc5(rg5) ^Zs 


(7.3) 


(iv) Check the strict dual feasibility condition that 

Pscll,, < 1. (7.4) 

By step (i), 9 has support contained in S. By step (hi), {9,z) is guaranteed to fulfill the 
equations from (7.1). By step (ii), the 5'-coordinates of 5 satisfy ‘their part’ of the subgradient 
condition. Thus, if the strict dual feasibility from step (iv) holds, then {9, z) satisfies the KKT 
conditions from (6.5). Having a strict inequality in (7.4) ensures that every solution to the 
original rSME problem has support contained in the true support S and since is assumed 
invertible, there is then only one solution (Wainwright, 2009, Lemma 1). The invertibility of 
Fgg is also what guarantees the uniqueness in step (i). 

If the PDW construction is successful, that is, if the strict dual feasibility condition can be 
established, then we may conclude the rSME 9 possesses all the desired properties. Indeed, 9 
equals 9 which has these properties by construction. 

Let A — 9 — 9*, where 9 is the solution to the support-restricted regularized score matching 
problem from (7.2). By definition, ||A||oo = ||As|joo. Furthermore, by step (iii) in the PDW 
construction. 


zs- 


r'scs(r33 ) ^(Ri_ 5 s (05 -I- As) -I- r2,s) - R-l,S<=s(^'s + ^s) - r2,S- 

+ r'scs(rss) (7.5) 


By Assumption 1, and the triangle inequality for the £oo norm. 


|5s=lloo<^ 


(1 - a) (||Ri_ss(6's + As)||oo + lk 2 .s||oo) 


+ ||Rl.S‘^s(^S + As)||oo + \\r2,S- 


< 


< 


(2-a) 

A 

(2-a) 

A 

( 2 -«) 

A 


|R-1.'S(0S + As)||oo + Ik 2 || 

|Rir H-Ri,.sAs||oo + ||r2 
, (2-a) 


+ (1 - a) 


+ (1 - a) 
-f (1 - a) 


|Ri6» 


A 


Ri,.5|ILI|As||oo + ^^T^|k 2 ||oo +(1 - a), 


=Gi 


= G2 


= G3 
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where the equality in the second to last line follows from the fact that = 0 . 
We observe that 


G. = l^x 


|e:,a.vec(Ri, 


blocks J OO 


(7.6) 


where 


® H = 
Wide 




0f 


0 


0 

n*T 


is an x matrix whose diagonal blocks are given by the rows of the the interaction matrix 
©*, each row being replicated m times. Moreover, vec(Ri^bi„ei,=) refers to the vectorization of 
the m diagonal blocks of Ri that are each of size m x m; recall Lemma 2. More precisely, if 
Rin,. •., Ri,m are the diagonal blocks of Ri, then vec(Ri_biockEi) is obtained by concatenating 
vec(Riy),..., vec(Ri^m) in that order. Equation (7.6) is the only argument relying on the 
block-diagonality of F and Ri. 

From (7.6), we obtain that 


Gi < 

A 


.elllool|vec(Ri)||oo<^\^|||0: 


,ei. 


I©* 


since we have assumed that ||vec(Ri)||oo = ||Ri||oo < ei- By construction, 

III©* lll^ = C 0 *. It follows, from our choice of A that Gi < a/3. 

By the assumption that ||r’ 2 ||oo < ^ 2 , we have 

G3 < < 3 , 

and it remains to similarly bound G 2 . We treat |||Ri_. 5 ||| and ||As||oo separately. 

We note that the rows of Ri,.s have at most d non-zero elements. It follows that |||Ri^.s|||^ < 
d||Ri||oo < dei < a/ 6 cr, where the last inequality holds by assumption. Since Fgs is assumed 
invertible, we have from the top block of equations in (7.1) that 

As = (rss)”^(-Ri,ss^s “ ^^)- 

Note that by assumption, Fss is invertible. We obtain that 


iSiloo ^ 


< 

||(rs5) 

< 

||(rs5) 

< 

||(rs5) 


-11 


-11 


|Rl,SS^slloo + |k2||oo + A 


I©,1. 


||vec(Ri)||oo + lk 2 ||oo + A 


X ’(-“( a. 

°° 3(2-a) 

Since ||Ri||oo < ei, we have |||Ri,ss||loo < d^i < 1/cr*- This implies that 


(7.7) 


* \-l| 


(r^s)-^Ri,s5|L< IIKr^s) 
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IIR-i.ssllL < 


00 













which gives us the following bound in the error in the inverse in the matrix norm, 


* \-i| 


(r55)-' - {r*ss) 


< 




* N-l| 


< 


ms) 


l|l^l,ss|| 


l-|ll(ns)-'lllooll|Rl,Ss| 


X lll(Vss)- 


Application of the triangle inequality, along with our definition of cr* = |||(r 55 ) 


. yields 


|(r55)- 


< IIKrssr'l 
= lll(rssr'| 

Cr* 


+ |||(r5s)-'-(ns)- 

1 


i-lll(r*ss)-'llloolllRi,55ll 


1 — dcr*ei 
Cr* 

1 — a/6’ 


(7.8) 


where the last inequality uses the assumption that dei < a/6cr*. Substituting (7.8) into (7.7), 
it is straightforward to show that G 2 < a/3. Therefore, Gi + G '2 + Gs < a, which yields that 

Ps“ll < 1- 

Along the way we have also proven the second part of the claim. Indeed, from (7.7) and 
(7.8), we have 


ll^slloo < 


Cr* (6 — a) 2cr* A 

1 — a/6 3(2 — a) 2 —a 


7.2. Proof of Corollary 1 


We need to show that the conditions in Theorem 1, specifically those in (6.6), hold with the 
claimed probability. Since r 2 = g — g* = vec(Imxm) — vec(Imxm) = 0, the second inequality 
in (6.6) can be trivially satisfied with any e 2 > 0. Thus, we only need to show that we can 
bound IjRilloo by some suitable ei with sufficiently large probability. To do so, we apply a 
Bernstein-type concentration inequality for the entries of W that is also used by Ravikumar 
et al. (2011). Lemma B.l below states the inequality, as given in their paper. 

The matrix Ri features only entries in W — S*. By taking a union bound over the 
entries of W, plugging in our lower bound for n and observing that cr = 1 in the Gaussian 
case. Lemma B.l yields that 


Pr 


!R 


1 llcJO ^ 


c* (log m’"! -I- log 4) 


In addition, each row in |||R.s|||oo features at most d entries from the matrix W — S*. Hence, 
it follows from another union bound, and choosing n at least 

c*c\d^{\ogrrd'^ -I-log4) 


where c* and ci are defined in the corollary statement, that 


Pr 




< 


1 

.^Ti -2 
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Thus, applying Theorem 1 with 


/ c* (log + log 4) 

V n 

shows that our choices for A and n give the high probability statement in Corollary 1 . 

When looking back at the proof of Theorem 1, we see that as a consequence of having r 2 = 0, 
we need only be concerned with bounding terms Gi and G 2 - We may thus bound Gi and G 2 
each by a/2 instead of a/3 and ignore the G 3 term entirely, as it is 0. This leads us to having 
Cl = (4/a)cr*, as opposed to the expected ( 6 /a)cr*- 


7.3. Proof of Corollary 2 


We proceed as for the proof of Corollary 1 and use concentration results to satisfy the bounds 
from ( 6 . 6 ) in Theorem 1. However, we now bound ||Ri|joo and ||r 2 ||oo using concentration 
inequalities for general log-concave measures (any truncated multivariate normal density is 
log-concave). 

Let = (Xii,... ,Xim) be i.i.d. according to N{0, (K*)“^) with truncation to K™. Take 


ei 


£2 


[(f) (logm^=^ +log 2 )]' 

y/n 

[(f) (logm’’^ -hlog 2 )]' 


max Var[Wf], 


max Var[W|]. 


(7.9) 

(7.10) 


From Lemma B.3 below, we know that for the absolute constant L specified in Lemma B.2, 
we have, 


Pr 


-Y,X,,X,kX^,- E[X,XkXj 


Pr 




2 = 1 


> £1 

< exp 1 

> £2 

< exp 1 


y/nei 


L V /max Xav[XjXkXj\ 

V lAA 

) 


y/ne2 ^ ■ 
/max Var[XjWfe], 

V 3,k,l 


for all j,k,£ = l,...,m. By a union bound over no more than 2m^ events, we have both 
IIR-i ||oo < £1 and ||r 2 ||oo < £2 with probability at least 1 — Y/rrC^ ^ as m ^ 00 . Applying 
Theorem 1 with the chosen ei and 62 thus shows that our choices for A and n lead to the claim 
in Corollary 2. 


8. Discussion 

This paper proposes the use of regularized score matching for estimation of conditional in¬ 
dependence graphs in high dimensions. The focus is on modifying the score matching loss of 
Hyvarinen (2005) with an £i penalty to accommodate underlying sparsity, which is in the spirit 
of popular existing methods such as glasso and neighborhood selection. This said, any other 
regularization scheme can be considered instead. For instance, the method from Defazio and 
Caetano (2012) can be applied to encourage hub structure in the inferred graph. 

Our study of the Gaussian example of Meinshausen (2008) suggests that £ 1 -regularized score 
matching falls in between neighborhood selection and glasso in terms of conditions for required 

30 





















Algorithm 1 

1: Initialize 0 = 0 

2 : Initialize S = arg max (r(x)0 + ^(x)) . 

3 \ ^ 

3: Initialize = —sign ((r(x)0 + p(x))^) 

4: Initialize = 0 

5: while ||r(x)0 + 0 (x)||^ > 0 and Hgg is invertible do 

6 : r?i <- min {?7 > 0 : |(r(x) 6 » + g(x)|j = |(r(x)e + g(x)|g, j ^ S}. 

7: ri 2 ■(- min {?7 > 0 : ( 6 » + rj$,)j = 0, j S S}. 

8 : )? <-min{77i,»72}- 

9 : e ^e + r)i 

10 : if ?7 = ? 7 i then 

11: Add variable that attains equality to S. 

12 : else 

13: Remove variable that attains 0 from S. 

14: end if 

15: Cg ^ (r(x)_§^)“^sign( 6 »_§) 

16: end while 


for graph selection consistency. Here, the glasso requires the most stringent conditions, and the 
score matching approach appears to be similar to pseudo-likelihood methods that work with 
symmetric estimates of precision matrices, such as SPACE (Peng et ah, 2009) and subsequent 
reformulations such as CONCORD (Khare, Oh and Rajaratnam, 2015). However, regularized 
score matching is particularly convenient in that the score matching loss is a quadratic function, 
even for non-Gaussian exponential families. This brings about piecewise linear solution paths 
and allows for a simple theoretical analysis. We anticipate that the simple structure of score 
matching will lead to further advances in graphical modeling, such as computationally efficient 
techniques to deal with corrupted or missing data, in the spirit of Loh and Wainwright (2012), 
or new methods to tune regularization parameters, as in Chichignoud, Lederer and Wainwright 
(2014). 

Regularized score matching is an interesting method for Gaussian models, as we showed 
empirically and theoretically. In particular, for consistency (under the usual irrepresentability 
conditions), the sample n must be on the order Q.{(P\ogm), which matches the conditions for 
the existing methods mentioned above. However, as our simulation study shows, regularized 
score matching really shines in the context of non-Gaussian models, where it eliminates the 
need to deal with computationally intractable normalization constants in a way that the loss 
continues to be a quadratic function of parameters. This opens a lot of new possibilities for 
graphical modeling such as the truncated normal model we applied to RNAseq data. 

Score matching applies to continuous data. While Hyvarinen (2007) discusses a ratio match¬ 
ing method for discrete data, it is not as computationally convenient as its continuous counter¬ 
part. A different approach of adding Gaussian noise to discrete data was proposed for imaging 
problems by Kingma and LeCun (2010). Exploring the merits of their approach for graphical 
modeling, and supplying supporting theory, would be an interesting problem for future work. 


Appendix A: Implementation 

The piecewise linear solution path for regularized score matching can be computed using Al¬ 
gorithm 1, which is an adaptation of the LARS-Lasso algorithm for linear regression (Efron 
et ah, 2004). It is also a special case of the algorithm found in Rosset and Zhu (2007). In our 
pseudocode, S is the current active set, i.e., S = {j : 9^ ^ 0} for the currently relevant value 
of the regularization parameter A. 
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Algorithm 2 


Input: Initial estimate 

Input: tmax, maximum number of iterations 
Input: e, the maximal tolerance level 


1 

2 

3 

4 

5 

6 : 

7 

8 
9 

10 


Initialize t ■<— 1 

Initialize C <— e + 1 (C stands for convergence criteria) 
while C > e or t < tmax do 
0(t) <j_ 0(t-l) 

for j 1, 2,.. ., s do 


0' 


(t) 


Soft 


r(x) 


A 

r(x),. 


end for 

c t- ||0W - 

t <-1 +1 

end while 


In the Gaussian and truncated Gaussian case, the algorithm stops when the active set has 
size IS"! = min{n,TO}m. For larger active sets the matrix is not invertible. Finding the 
step size in Algorithm 1 requires O (min{n,m}m) operations, while the inversion step is at its 
worst 0(|5p) = O (min{n, . Overall, the complexity of Algorithm 1 can be found to 

be O (min{n, the heaviest cost comes from the matrix inversion step. 

For large-scale problems, LARS-type algorithms may be slow and coordinate-descent meth¬ 
ods are popular alternatives (see e.g. Friedman et ah, 2007). Algorithm 2 describes a coordinate- 
descent algorithm to minimize the regularized score matching objective from (3.1). It entails 
updating one coordinate, or one element in the parameter vector/matrix, such that it mini¬ 
mizes the objective function while holding all others as constant, until a convergence criterion 
is satisfied. Results in Tseng (2001) ensure convergence of Algorithm 2. 

Example 1 (cont.). For the Gaussian case, the coordinate descent procedure alternates be¬ 
tween updating the diagonal entries and off-diagonal entries, by manipulating the estimating 
equations (3.10) and (3.11) accordingly. The updates are of the form 

(t+i) , 

(t-l-l) (t-fl) Q f. f ~J2k'^k'0’jk"^Vk 2A \ 

« /C : 4,- ^ Soft —— ^, 

■> ^ y Wjj + Wkk Wjj -I- Wkk J 


for j,k G {l,...,m}. The computational complexity of this scheme can be shown to be 
mm{0{nm'^),0{m^)), which is the same as for the methods classified under SPAGE; the com¬ 
plexity of glasso is 0{m^). We do not prove this fact, as it follows directly from reasoning 
elaborated on in Khare, Oh and Rajaratnam (2015). 


Appendix B: Concentration results 

Corollaries 1 and 2 make use of the following concentration results. The first lemma is used 
to prove Corollary 1 while the latter two (one is derived from the other) are used to prove 
Corollary 2. 

Lemma B.l (Ravikumar et ah, 2011). If (Ai,...,Am) is a zero-mean random vector with 
covariance matrix S* such that Xijis sub-Gaussian with scale parameter a, then the 
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sample covariance matrix W, for n i.i.d. samples, satisfies the bound 


Pr[|W,fc - > 5] < 4 exp 


nS'^ 

128(1 + 4(t2)2 (S* )2 


(B.l) 


for any fixed choice of two indices 1 < j,k < m and for all S € (0,40 max S* ■). 

Lemma B.2 (Carbery and Wright, 2001). Let X be a Banach space, and let f : M™ -P- X be 
a polynomial of degree at most z. Suppose 0 < Ci < C 2 < oo and p is a log-concave probability 
measure on K™. Then 


(^j\\f{x)f^/^dp{x)Y‘'^ < (/ ’ (B.2) 

where L > 0 is an absolute constant. 

From this lemma we may derive the following concentration result. After proving the lemma, 
we comment on how it is used in the proof of Corollary 2. 

Lemma B.3. Consider a degree z polynomial f{X) = f{Xi,... ,Xm), where Xi,... ,Xm are 
possibly dependent random variables with log-concave joint distribution on K™. Let L > 0 be 
the constant from Lemma B.2. Then, for all 6 such that 


K:=i 


l/z 


L leVVar[/(X)]^ 


> 2 , 


have, 


l/z 


Pr[|/(X)-F;[/(X)]| ><5] < exp<^-- 


L \^VVar[/(A)]^ 

Proof. Choosing = 2z and C 2 = iFz in Lemma B.2, we have 


E[\f{X)-E[f{X)]\^]i < x/Var[/(A)]. 


Hence, by Markov’s inequality, for any 6 satisfying (B.3), 

E[\f{X)-E[fiX)]n 


P[\f{X)-E[f{X)]\>S] < 


< 


I^LKJ VVar[/(A)] 
exp{—K} 


K 


= exp 


1_) 

L \^Yar[f{X)]J 


and the proof is complete. 


(B.3) 

(B.4) 


(B.5) 

(B.6) 

(B.7) 

(B.8) 

□ 
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In the proof of Corollary 2, we apply Lemma B.3 with 5 = Ci from (7.9) and with 6 = 62 
from (7.10). It thus needs to be checked that condition (B.3) holds in these two cases. Indeed, 
the condition holds as long as 


m > exp 


2^6 - log 2 


T2 


(B.9) 


To see this, we substitute ei and €2 for 6 in (B.3), take z = A and 2 respectively, to find a term 
that is lower bounded by {t 2 logm + log2)/e^. Here, the l/^/n factor in ei and 62 cancels out 
with the Aj^/n term generated by the •\/Var[/(X)] term in the denominator. (Recall that in 
our scenario /(X) is an empirical average). The more stringent condition on m comes from £2 
and is stated in (B.9). Thus, if (B.9) holds, (B.3) is satisfied. Since T 2 > 3, the right-hand side 
of (B.9) never exceeds 


exp 


i(2Ve - log2)| 


< 3. 


Hence, in our application of Lemma B.3, the condition from (B.3) holds for m > 3. 


Appendix C: Experiments 

We perform experiments, similar to those found in related work, that give empirical support 
for Corollary 1. This corollary treats Gaussian graphical models for which the sample size n 
ought to be of order (P logm. We experiment by varying the number of variables to, the degree 
d, and the minimum signal strength. Following Ravikumar et al. (2011), we define the ‘model 
complexity’ to be 

C := -cr- X max S*,. (C.I) 

a j 

In addition, we investigate how the sample size n required for sparsistency for non-negative 
Gaussian graphical models needs to depend on to. All reported results are based on averaging 
over 100 trials. 


C. 1. Gaussian experiments 

We conduct our experiments using three graph structures: (a) a chain, (b) a 2-D lattice with 4 
nearest neighbors, and (c) a star. We consider (a) and (b) when varying the number of variables 
TO, in which case we vary the length of the chain and the number of nodes in the lattice. This 
keeps the degree d constant. The effect that d has on the sample complexity is investigated 
using stars. We let the regularization parameter A scale with -^/logTO/n, a choice corroborated 
by Corollary I. 


Dependence on number of nodes 

Consider first the case where the underlying conditional independence graph is a chain of length 
TO S {64,100, 225,375}. The degree d is always 2, and we choose the tridiagonal precision matrix 
K* to have entries = 0.3 if (j, k) G E and = 1 for j = 1,... to. Here, a, ck* and cr* 
are constant across all to. 

Figure 10 shows the probability of correct signed support recovery plotted against the sample 
size n, with different curves corresponding to different to. As expected, we see from Figure 
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(a) (b) 


Fig 10: Relative frequencies of signed support recovery for Gaussian observations with a conditional 
independence graph that is a chain of varying length m. Panels (a) and (b) differ only in the scaling 

of the x-axis. The colored lines correspond to m = 64 ( - ), m = 100 ( - ), m = 225 ( - ) and 

m = 375 ( - ). 


10(a) that successful support recovery requires n to grow with m. However, upon rescaling n 
by 1/logm, the curves overlap as seen in Figure 10(b). 

We repeat the experiment with the 2-D lattice graph with m G {64,100, 225} nodes. Each 
node is connected to four nearest neighbors such that the degree d is always 4. We choose 
K* with Kjj. = 0.2 for {j,k) G E and K*j = 1 for j = 1 ,...to. Again, a, ck* and cr* are 
constant across all m. The results are presented in Figure 11, which shows curves of recovery 
probabilities that stack on top of one another when n by 1/logm. 

We conclude that with C and d held constant, the sample size n needs to scale with logm 
for consistent signed support recovery. This is consistent with Corollary 1. 


Dependence on node degree 

We now fix the number of nodes to m = 200 and vary d. We consider a star graphs with varying 
hub node degree d G {15, 20, 25}. The precision matrix K* is chosen such that cr*^ = 2.5/d for 
(j, k) G E, and crT = 1 for j = 1,... m. Now, a, ck* and cr* are constant across all d. 

Figure 12 shows the probability of correct signed support recovery plotted against n. The 
left panel demonstrates that correct recovery is more difficult with increasing d. Larger n is 
needed to attain the same success rate. Upon rescaling n by 1/d^ in the right panel, the three 
curves align. This validates Corollary 1 in that for fixed m, a, ck» and cr*, the sample size n 
needs to scale with to ensure sign consistency. 


Dependence on ‘model complexity’ 

We return to the chain-structured graphs considered earlier in this section. This time, however, 
we fix m = 64 and d = 2 while changing the edge strengths k*/. for (j, k) G E, which alters C 
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(a) (b) 


Fig 11: Relative frequencies of signed support recovery for Gaussian observations whose conditional 
independence graph is a 4-nearest neighbor lattice with m nodes. Panels (a) and (b) differ only in the 

scaling of the x-axis. The colored lines correspond to m = 64 ( - ), m = 100 ( - ), and m = 225 

( - )• 


from (C.l). We plot the probability of correct signed support recovery against n for varying C. 
In the resulting Figure 13, the curves shift right as C becomes larger so a larger n is needed 
to attain the same probability of correct signed support recovery when C grows. This is again 
consistent with the implications of Corollary 1. We do not believe that the lower bound we 
found for n is sharp enough in terms of its dependence on a, ck* and cr* to determine the 
rescaling we must perform on n to align the curves. 

C.2. Non-negative Gaussian experiments 

Finally, we experiment with regularized non-negative score matching for normal observations 
truncated to the positive orthant. According to Corollary 2, a sample size of n = il((i^(logTO)®) 
is sufficient for signed support recovery. The aim of our experiments is to explore to what extent 
this scaling is necessary. Specifically, we will consider exponents other than 8 for logm. 

For our experiments, we revisit the chain-structured graphs from Section C.l and choose a 
triangular matrix K* with = 0.3 if (j, k) € E and and K*j = 1 for j = 1,... to. The degree 
d is fixed at 2 and we only vary to S {20,25, 30}. We let the regularization parameter A to scale 
with y/ (logTO)®/n. Figure 14 plots the probability of correct signed support recovery against 
n, with different curves for the different values of to. 

Panel (a) in Figure 14 illustrates that, larger n is needed account for larger to. The other 
three panels have the x-axis rescaled to n/(logTO)“ for exponents a G {6,7,8}. Panel (b) 
suggests that n scaling with (log to)® is not sufficient for support recovery. Comparing panels 
(c) and (d), (logm)® seems more than what is necessary. It thus appears that the scaling of 
the sample size we assumed in Corollary 2 is suboptimal but not drastically so. 
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(a) (b) 


Fig 12: Relative frequencies of signed support recovery for Gaussian observations whose conditional 
independence graph is a star with varying degree d. Panels (a) and (b) differ only in the scaling of the 
x-axis. The colored lines correspond to d = 10 ( - ), d = 15 ( - ), and d = 20 ( - ). 



n 


Fig 13: Relative frequencies of signed support recovery for Gaussian observations whose conditional 
independence graph is a chain of fixed length. The different cnrves correspond to different signal 

strength summarized in the model complexity C. The colored lines correspond to C = 857 ( - ), 

C = 668 ( - ), C = 576 ( - ) and C = 543 ( - .) 
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(a) 


(b) 



n/log(m)' 



(c) 


(d) 


Fig 14: Relative frequencies of signed support recovery for truncated Gaussian observations whose 
conditional independence graph is a chain of varying length m. The four panels differ only in the 

scaling of the a:-axis. The colored lines correspond to m = 20 ( - ), m = 25 ( - ), and m = 30 

(-)• 
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